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In this paper we study the ergodicity properties of some adap- 
tive Markov chain Monte Carlo algorithms (MCMC) that have been 
recently proposed in the literature. We prove that under a set of 
verifiable conditions, ergodic averages calculated from the output of 
a so-called adaptive MCMC sampler converge to the required value 
and can even, under more stringent assumptions, satisfy a central 
limit theorem. We prove that the conditions required are satisfied 
for the independent Metropolis-Hastings algorithm and the random 
walk Metropolis algorithm with symmetric increments. Finally, we 
propose an application of these results to the case where the pro- 
posal distribution of the Metropolis-Hastings update is a mixture of 
distributions from a curved exponential family. 



1. Introduction. Markov chain Monte Carlo (MCMC), introduced by 
Metropolis et al. [27], is a popular computational method for generating 
samples from virtually any distribution tt. In particular, there is no need 
for the normalizing constant to be known, and the space X C W 1 * (for some 
integer n x ) on which it is defined can be high dimensional. We will hereafter 
denote by £>(X) the associated countably-generated cr-field. The method con- 
sists of simulating an ergodic Markov chain {X^, k > 0} on X with transition 
probability P such that n is a stationary distribution for this chain, that is, 
ttP = tt. Such samples can be used, for example, to compute integrals of the 
form 
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for some 7r-integrable function / :X — > R, using estimators of the type 

1 n 

a) ^(/) = -E/W- 

n fc=i 

In general, the transition probability P of the Markov chain depends on 
some tuning parameter, say 6, defined on some space O C M" 9 for some 
integer no, and the convergence properties of the Monte Carlo averages in 
equation (1) might strongly depend on a proper choice of these parameters. 

We illustrate this here with the Metropolis-Hastings (MH) update, but it 
should be stressed at this point that the results presented in this paper apply 
to much more general settings (including, in particular, hybrid samplers, 
sequential or population Monte Carlo samplers) . The MH algorithm requires 
the choice of a proposal distribution q. In order to simplify the discussion, we 
will here assume that it and q admit densities with respect to the Lebesgue 
measure A Leb , denoted (with an abuse of notation) it and q hereafter. The 
role of the distribution q consists of proposing potential transitions for the 
Markov chain {X^}. Given that the chain is currently at x, a candidate y is 
accepted with probability a(x,y), defined as 

a(x,y)= 1A ^)^)> **(*M*,V)>0, 
{ 1, otherwise, 

where a A b = min(a, b). Otherwise it is rejected, and the Markov chain stays 
at its current location x. For (x,A) € X x B(X), the transition kernel P of 
this Markov chain takes the form 



(2) 



P(x,A)= / a(x,x + z)q(x,x + z)\ Leb (dz) 
JA-x 

+ 1a 0*0 / (1 -a(x,x + z))q(x,x + z)\ Lch (dz), 
JX-x 



where A — x == {z £ X, x + z S ^4}. The Markov chain P is reversible with re- 
spect to 7r and therefore admits ir as an invariant distribution. Conditions on 
the proposal distribution q that guarantee irreducibility and positive recur- 
rence are mild, and many satisfactory choices are possible. For the purpose of 
illustration, we concentrate in this introduction on the symmetric increments 
random walk MH algorithm (hereafter SRWM), in which q(x,y) = q{y — x) 
for some symmetric probability density q on M. nx , referred to as the incre- 
ment distribution. The transition kernel P^ RW of the Metropolis algorithm 
is then given for i,i£Xx <6(X) by 



R SRW (x, A) = [ ( 1 A 7T{X / \ Z) )q(z)X Lch (dz) 
1 Ja-x\ tt{x) J 
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(3) 

+ t A {x) [ (l-lA 7T{X ( \ Z) )q(z)X Lch (dz). 

A classical choice for q is the multivariate normal distribution with zero 
mean and covariance matrix T, JV(0, T). We will later on refer to this algo- 
rithm as the N-SRWM. It is well known that either too small or too large a 
covariance matrix will result in highly positively correlated Markov chains 
and therefore estimators S n (f) with a large variance. Gelman, Roberts and 
Gilks [16] have shown that the "optimal" covariance matrix (under restric- 
tive technical conditions not given here) for the N-SRWM is (2.38 2 /n x )r, r , 
where r„- is the true covariance matrix of the target distribution. In practice, 
this covariance matrix T is determined by trial and error, using several real- 
izations of the Markov chain. This hand-tuning requires some expertise and 
can be time consuming. In order to circumvent this problem, Haario, Saks- 
man and Tamminen [19] have proposed to "learn T on the fly." The Haario, 
Saksman and Tamminen [19] algorithm can be summarized as follows: 

Mfe+i = Mfc + 7fc+iP0c+i ~ Vk), k>0, 

(4) 

Tfc+i = Tfc + 7A,-+i((A"fc + i — fi k )(X k+ i — Hk) — Tfc)) 
where, denoting by C]^ the cone of positive n x x n x matrices: 

• X k+1 is drawn from P 6k (X k ,-), where for 9 = (fi,T) £ 6 = K n * x C+", 
Pq = P^/^xy) 1S here the kernel of a symmetric random walk MH with a 
Gaussian increment distribution M(0,XT), A > being a constant scaling 
factor depending only on the dimension of the state space n x and kept 
constant across the iterations; 

• {jk} is a nonincreasing sequence of positive stepsizes such that J2k°=i Ik = 
oo and J2k°=i < 00 f° r some ^ > (Haario, Saksman and Tamminen 
[19] have suggested the choice 7^ = 1/k). 

It was realized in [4] that such a scheme is a particular case of a more 
general framework. More precisely, for 9 = (n,T) £ O, define fl:9xX-> 

W 1 * x R«* x «* as 

(5) H(9, x) d = (x - n, (x - it)(x - fi) T - T) T . 
With this notation, the recursion in (4) may be written as 

(6) k+ i = 6 k + 7 k+1 H(0 k ,X k+ i), k>0, 

with X k+ \ ~ Pg k (X k ,-). This recursion is at the core of most of classical 
stochastic approximation algorithms (see, e.g., [9, 13, 23] and the references 
therein). This algorithm is designed to solve the equations h{9) = where 
9 1 — ► h(9) is the so-called mean field defined as 

(7) h(9)= f H(9,x)ir(dx). 

Jx 
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For the present example, assuming that J x \x\ 2 7r(dx) < oo, one can easily 
check that 

(8) h(6) = [ H(6, x)vr(dx) = - fi, (^ - / u)(^ - ^) T + IV - T) T , 
with fj, n and IV the mean and covariance of the target distribution, that is, 

(9) fi w = / xir(dx) and T n = (x — fi w )(x — fi w ) J Tr(dx). 

Jx Jx 

One can rewrite (6) as 

6k+l = &k + Tfc+lM^fc) + 7fc+l6c+l> 

where = Jf^) — h(6k-i),k > 1} is generally referred to as "the 

noise." The general theory of stochastic approximation (SA) provides us 
with conditions under which this recursion eventually converges to the set 
{6 £ 0, h(6) = 0}. These issues are discussed in Sections 3 and 5. 

In the context of adaptive MCMC, the parameter convergence is not the 
central issue; the focus is rather on the approximation of 7r(/) by the sample 
mean S n (f). However, there is here a difficulty with the adaptive approach: 
as the parameter estimate 9^ = 6k(Xo, . . . ,Xk) depends on the whole past, 
the successive draws {X^} do not define an homogeneous Markov chain, and 
standard arguments for the consistency and asymptotic normality of S n (f) 
do not apply in this framework. Note that this is despite the fact that, 
for any 6 E 0, ttPq = ir. This is illustrated by the following example. Let 
X = {1,2} and consider for 6, 6(1), 0(2) G the following Markov transition 
probability matrices: 

1-0(1) m i 

6(2) 1-6(2) ' 

One can check that for any 6 £ O, tt = (1/2, 1/2) satisfies ttPq = tt. However, 
if we let 6^ be a given function 6:X^> (0,1) of the current state, that is, 
6k = 0(Xk), one defines a new Markov chain with transition probability P 
now having [6(2) / (6(1) + 6 (2)), 6(1) / (6(1) + 6(2))} as invariant distribution. 
One recovers tt when the dependence on the current state X^ is removed 
or vanishes with the iterations. With this example in mind, the problems 
that we address in the present paper and our main general results can be 
summarized as follows: 

1. In situations where — 6^1 — > as h — > +oo w.p. 1, we prove a strong 
law of large numbers for S n (f) (see Theorem 8) under mild additional 
conditions. Such a consistency result may arise even in situations where 
the parameter sequence {6k} does not converge. 

2. In situations where 6k converges w.p. 1, we prove an invariance principle 
for \/n(S n (f) — 7r(/)); the limiting distribution is, in general, a mixture 
of Gaussian distributions (see Theorem 9). 



Pe 
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Note that Haario, Saksman and Tamminen [19] have proved the con- 
sistency of Monte Carlo averages for the algorithm described by (4). Our 
results apply to more general settings and rely on assumptions which are 
less restrictive than those used in [19]. The second point above, the in- 
variance principle, has, to the best of our knowledge, not been addressed 
for adaptive MCMC algorithms. We point out that Atchade and Rosenthal 
[6] have independently extended the consistency result of Haario, Saksman 
and Tamminen [19] to the case where X is unbounded, using Haario, Saks- 
man and Tamminen's [19] mixingale technique. Our technique of proof is 
different, and our algorithm allows for an unbounded parameter 6 to be 
considered, as opposed to Atchade and Rosenthal [6]. 

The paper is organized as follows. In Section 2 we detail our general proce- 
dure and introduce some notation. In Section 3 we establish the consistency 
(i.e., a strong law of large numbers) for S n (f) (Theorem 8). In Section 4 we 
strengthen the conditions required to ensure the law of large numbers (LLN) 
for S n (f) and establish an invariance principle (Theorem 9). In Section 5 
we focus on the classical Robbins-Monro implementation of our procedure 
and introduce further conditions that allow us to prove that converges 
w.p. 1 (Theorem 11). In Section 6 we establish general properties of the 
generic SRWM required to ensure an LLN and an invariance principle. For 
pedagogical purposes we show how to apply these results to the N-SRWM 
of [19] (Theorem 15). In Section 7 we present another application of our 
theory. We focus on the independent Metropolis-Hastings algorithm (IMH) 
and establish general properties required for the LLN and the invariance 
principle. We then go on to propose and analyse an algorithm that matches 
the so-called proposal distribution of the IMH to the target distribution 
7r, in the case where the proposal distribution is a mixture of distributions 
from the exponential family. The main result of this section is Theorem 
21. We conclude with the remark that this latter result equally applies to 
a generalization of the N-SRWM where the proposal is again a mixture of 
distributions. Application to samplers which consist of a mixture of SRWM 
and IMH is straightforward. 

2. Algorithm description and main definitions. Before describing the 
procedure under study, it is necessary to introduce some notation and defi- 
nitions. Let T be a separable space and let B(T) be a countably-generated 
o"-field on T. For a Markov chain with transition probability P:Tx £>(T) — ► 
[0,1] and any nonnegative measurable function /:T— ► [0, +oo), we define 

Pf(t) = P(t,f) d = J T P(t,dt')f(t') and for any integer k, denote by P k the 
kih iterate of the kernel. For a probability measure fi, we define, for any 

A G B(T), fiP(A) = J T fi(dt)P(t,A). A Markov chain on a state space T 
is said to be ji- irreducible if there exists a measure \i on B(T) such that, 
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whenever (j,(A) > 0, £fcto p/c (M) > for all t G T. Denote by [i a maxi- 
mal irreducibility measure for P (see [28], Chapter 4, for the definition and 
the construction of such a measure). If P is /z-irreducible, aperiodic and 
has an invariant probability measure tt, then tt is unique and is a maximal 
irreducibility measure. 

Two main ingredients are required for the definition of our adaptive 
MCMC algorithms: 

1. A family of Markov transition kernels on X, {Pg,9 G 0} indexed by a 
finite-dimensional parameter 9 G C M n " , where is asumed to be an 
open set. For each 9 in 0, it is assumed that Pg is 7r-irreducible and that 
nPg = tt, that is, tt is the invariant distribution for Pg. 

2. A family of update functions {H(8, i):6xXh W 19 } which are used to 
adapt the value of the tuning parameter. 

The adaptive algorithm studied in this paper (which corresponds to the 
process {Z k } defined below) requires for both its definition and study the 
introduction of an intermediate "stopped" process, which we now define: 

First, in order to take into account potential jumps outside the space 0, 
we extend the parameter space with a cemetery point, 9 C ^ 0, and define 

6 = f 6U {9c}. It is convenient to introduce the family of transition kernels 
{Q-y,7 > 0} such that for any 7 > 0, (x, 9) G X x 0, A G B(X) and B G B(Q), 

Q4x,9;AxB)= [ Pg(x, dy)t{9 + ^H(9, y) G B} 

(10) 

+ Sg c (B) I P e (x,dy)l{9 + jH(9,y)?&}, 

J A 

where 5g denotes the Dirac delta function at 9 G 0. In its general form, 
the basic version of the adaptive MCMC algorithm considered here may 
be written as follows. Set #0 = 9 G 0, Xq = x £X and, for k > 0, define 
recursively the sequence {(X^, 9^), k > 0}: if 9^ = 9 C , then set 9^+1 = 9 C and 
X k+1 = x, otherwise (X k+1 ,0 k+1 ) ~ Q Pk+1 (X k , 9 k ; •), where p = {p k } is a 
sequence of stepsizes. The sequence {(X k ,9 k )} is a nonhomogeneous Markov 
chain on the product space X x 0. This nonhomogeneous Markov chain 
defines a probability measure on the canonical state space (X x 0) N equipped 
with the canonical product <r-algebra. We denote by T = {F k ,k > 0} the 
natural filtration of this Markov chain and by P£ e and E£ e the probability 
and the expectation associated with this Markov chain starting from (x, 9) G 
X x 0. 

Because of the interaction with feedback between X k and 9 k , the stabil- 
ity of this inhomogeneous Markov chain is often difficult to establish. This 
is a long-standing problem in the field of stochastic optimization. Known 
practical remedies for this problem include the reprojections on a fixed set 
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(see [23]) and the more recent reprojection on random varying boundaries 
proposed in [11, 12] and generalized in [3]. 

More precisely, we first define the notion of compact coverage of G. A 
family of compact subsets {JC q ,q > 0} of G is said to be a compact coverage 
if 

(11) (J/C g = Q and JCg C int(/C,+i), q > 0, 

q>0 

where int(A) denotes the interior of set A. Let 7 = f {7*;} be a monotone 
nonincreasing sequence of positive numbers and let K be a subset of X. For 
a sequence a = {a^} and an integer I, we define the "shifted" sequence a*"' as 

follows: for any k > 1, ajr' = dk+l- Let II:Xx0— >Kx ICq be a measurable 
function. Define the homogeneous Markov chain = {(X^, 9k, v^), k > 

0} on the product space Z = Xx0xNxN, with transition probability 
R:Zx B(Z) :— > [0, 1] algorithmically defined as follows (note that in order to 
alleviate notation, the dependence of R on both 7 and {KL q ,q > 0} is implicit 
throughout the paper). For any (x,9,k,,v) G Z: 

1. If u = 0, then draw (X', 9') ~ Q lK (U(x, 9); ■); otherwise, draw (X',9') ~ 

2. If 9' € K, K , then set k! = k and v' = v + 1; otherwise, set = 1, and 
i/ = 0. 

In words, k and ^ are counters: k is the index of the current active truncation 
set and u counts the number of iterations since the last reinitialization. 
The event {y^ = 0} indicates that a reinitialization occurs. The algorithm is 
restarted at iteration k from a point in K x /Co with the "smaller" sequence 
of stepsizes 7^ Kfe . Note the important fact, at the heart of our analysis, that, 
between reinitializations, this process coincides with the basic version of the 
algorithm described earlier, with p = 7^ Kfc . This is formalized in Lemma 1 
below. 

This algorithm is reminiscent of the projection on random varying bound- 
aries proposed in [11, 12]: whenever the current iterate wanders outside the 
active truncation set, the algorithm is reinitialized with a smaller initial 
value of the stepsize and a larger truncation set. 

The homogeneous Markov chain {Zk, k > 0} defines a probability measure 
on the canonical state space Z N equipped with the canonical product a- 
algebra. We denote by Q = {Gk-,k > 0}, f x ,e,k,l and ^ x ,0,k,l the filtration, 
probability and expectation associated with this Markov chain started from 
(x,9,k,l) £ Z. For simplicity, we will use the shorthand notation E* and P* 

for E Xt g = 1^,0,0 and ¥ xfi = P, AO ,o for all (x,9) G X x G. 

These probability measures depend upon the deterministic sequence 7. 
The dependence will be implicit hereafter. We define recursively {T n ,n > 0} 
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the sequence of successive reinitialization times 

(12) T n+1 = inf{fc >T n + l,u k = 0} with T = 0, 

where, by convention, inf{0} = oo. It may be shown that, under mild con- 
ditions on {Pq,9 S 0}, {H(9,x), (0,x) G O x X} and the sequence 7, 

sup K n < 00 ) = PJ |){r n = oo} =1, 
™>° J \n=0 / 

that is, the number of reinitializations of the procedure described above is 
finite P*-a.s. We postpone to Sections 5, 6 and 7 the presentation and dis- 
cussion of simple sufficient conditions that ensure that this holds in concrete 
situations. We will, however, assume this property to hold in Sections 3 and 
4. Again, we stress the fact that our analysis of the homogeneous Markov 
chain {Z k } ("the algorithm") for a given sequence 7 relies on the study of 
the inhomogeneous Markov chain defined earlier (the "stopped process"), 

for the sequences {p k = f 7^ Kfc } of stepsizes. It is therefore important to 
precisely and probabilistically relate these two processes. This is the aim of 
the lemma below (adapted from [3], Lemma 4.1). 
Define, for K, C 0, 

(13) ff(/C)=inf{Jfe>lA££}- 

Lemma 1. Given any m>l, any nonnegative measurable function ^ m : (X x 
0) m — > MT, for any integer k>0 and (x, 9) £ X x 0, satisfies 

Sz,0,fc,o[*m(*i,0i ) • • • ,X m ,9 m )t{Tx > m}] 

= E] l ^ e) [^ m (X 1 ,9 1 , . . . ,X m ,9 m )t{o-(K, k ) > m}]. 



3. Law of large numbers. Hereafter, for a probability distribution P, 
the various kinds of convergence — in probability, almost-sure and weak (in 

distribution) are denoted, respectively, P —>w, — — >p and 



3.1. Assumptions. As pointed out in the Introduction, an LLN has been 
obtained for a particular adaptive MCMC algorithm by Haario, Saksman 
and Tamminen [19], using mixingale theory (see [24]). Our approach is more 
in line with the martingale proof of the LLN for Markov chains and is based 
on the existence and regularity of the solutions of Poisson's equation and 
martingale limit theory. The existence and appropriate properties of those 
solutions can be easily established under a uniform (in the parameter 9) 
version of the ^/-uniform ergodicity of the transition kernels P$ [see condition 
(Al) below and Proposition 3]. 
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We will use the following notation throughout the paper. For W : X — ► 
[1, oo) and / : X — ► R a measurable function, define 

(14) \\f\\ w=SVLp \lM and C w = {f:\\f\\w<oo}. 

xeX W[x) 

We will also consider functions / : Q x X — > R. We will often use the short- 
hand notation fg(x) = f(9,x) for all 9,x G x X in order to avoid ambigu- 
ities. We will assume that fg = whenever 0^0 except when fg does not 
depend on 9, that is, / = fg/ for any (6>,6>') £0x8. Let W :X -> [l,oo). 
We say that the family of functions {/g : X — > R, G G} is W -Lipschitz if, for 
any compact subset K, C Q, 

(15) sup 1 1 /el Ivy < oo and sup |0 - 0'|~ H/s ~ fo'Ww < °°- 

(Al) For any G G, Pg has 7r as its stationary distribution. In addition, there 
exists a function V :X— ► [l,oo) such that sup xgK V(x) < oo (with K 
defined in Section 2) and such that, for any compact subset K. C Q: 

(i) Minorization condition. There exist C G B(X), e > and a prob- 
ability measure (all three depending on /C) such that </?(C) > and, 
for all A G B(X) and 9,x £ IC x C, 

Pe(x,A)>e^(A). 

(ii) Dri/i condition. There exist constants AG [0, 1), 6 G (0, oo) (de- 
pending on V, C and /C) satisfying 



K ' ~ \ b, xG C, 



for all 0€JC. 



(A2) For any compact subset fC C Q and any r G [0, 1] , there exists a con- 
stant C (depending on /C and r) such that, for any (6, 8') G JC x /C and 

/ G £y>~, 

IW-Py/llv^cn/iivrie-^i, 

where V is given in (Al). 
(A3) {H e , 6 G G} is ^-Lipschitz for some G [0, 1/2] with V defined in (Al). 



Remark 1 . Note that for the sake of clarity and simplicity, we restrict 
here our results to the case where {Pg,0 6 6} satisfy one-step drift and 
minorization conditions. As shown in [3], the more general case where either 
an m-step drift or minorization condition is assumed for m > 1 requires one 
to modify the algorithm in order to prevent large jumps in the parameter 
space (see [2, 3]). This mainly leads to substantial notational complications, 
but the arguments remain essentially unchanged. 
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Conditions of type (Al) to establish geometric ergodicity have been exten- 
sively studied over the last decade for the Metropolis-Hastings algorithms. 
Typically, the required drift function depends on the target distribution n, 
which makes our requirement of uniformity in 9 G /C in (Al) reasonable and 
relatively easy to establish (see Sections 6 and 7). The following theorem, 
due to [29] and recently improved by [8] , converts information about the drift 
and minorization condition into information about the long-term behavior 
of the chain: 

Theorem 2. Assume (Al). Then, for all 9 G 0, Pq admits tt as its 
unique stationary probability measure, and n(V) < oo. Let K, C be a com- 
pact subset and r G [0, 1]. There exists p < 1 depending only (and explicitly) 
on the constants r, e, <^(C), A and b [given in (Al)] such that, whenever 
pG (p, 1), there exists C < oo depending only (and explicitly) on r, p, e, 
99(C) and h such that for any f G Cyr ; all 9 G K, and k > 0, 



Formulas for p and C are given in [29], Theorem 2.3, and have been later 
improved in [8], Section 2.1. 

This theorem automatically ensures the existence of solutions to Poisson's 
equation. More precisely, for all 6, x G x X and / G J2T=o We f( x ) ~ 



Poisson's equation has proven to be a fundamental tool for the analysis of 
additive functionals, in particular for establishing limit theorems such as the 
(functional) central limit theorem (see, e.g., [9, 13, 18, 30], [28], Chapter 17. 

The Lipschitz continuity of the transition kernel Pq as a function of 9 
[assumption (A2)] does not seem to have been studied for the Metropolis- 
Hastings algorithm. We establish this continuity for the SRWM algorithm 
and the independent MH algorithm (IMH) in Sections 6 and 7. This assump- 
tion, used in conjunction with (Al), allows one to establish the Lipschitz 
continuity of the solution of Poisson's equation. 

Proposition 3. Assume (Al). Suppose that the family of functions 

{fe,G G ©} is V r -Lipschitz, for some r G [0, 1]. Define, for any 9 G 0, f$ = 
Y^=o{Pe fe ~ 7T (fe))- Then, for any compact set K,, there exists a constant 
C such that, for any {9,9') G JC, 



(16) 



^7-vr(/)||^<C7||/||^p fc . 




(18) 



II v- + \\Pefe\\vr < Csup||/ e ||yr, 
eeic 



(19) 



II/0 - fo'Wv + H^/0 -Pe'fo' II V < C\0 -6' \ SWp \\fg\\ V r. 



ee/c 
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The proof is given in Appendix B. 

Remark 2. The regularity of the solutions of Poisson's equation has 
been studied, under various ergodicity and regularity conditions on the map- 
ping 9 i— > Pq (see, e.g., [9] and [7] for regularity under conditions implying 
^-uniform geometric ergodicity). The results of the proposition above are 
sharper than those reported in the literature, because all the transition ker- 
nels Pq share the same limiting distribution tt, a property which plays a key 
role in the proof. 

We finish this section with a convergence result for the chain {X k } under 
the probability P^g, which is an important and direct byproduct of the 
property mentioned in the remark immediately above. This result improves 
on [19, 21] and [6]. 

Proposition 4. Assume (A1)-(A3), let p G (0,1) be as in (16), letp = 
{Pk} be a positive and finite, nonincreasing sequence such that 
limsup^oo Pk-n{k)l Pk < +oo with, when k>ko for some integer ko, n(k) := 
[log(p k / p) / log(p) \ + 1 < k, andn(k) = otherwise, where p G [pi,+oo). Let 
f G Cyi-0, where V is defined in (Al), and (3 is defined in (A3). Let /C be a 
compact subset of G. Then, there exists a constant C G (0, +oo) [depending 
only on JC, the constants in (Al) and p] such that, for any (x,9) G X x K,, 

\KA(f( X k) ~ <f))H<r(K) >k}}\< C\\f\\ vl -f,p k V(x). 

The proof appears in Appendix B. 

3.2. Law of large numbers. We prove in this section a law of large num- 
bers (LLN) under for n-^Li/^^fc). wh ere {/*,£>€ 6} is a set of 
sufficiently regular functions. It is worth noting here that we need not re- 
quire that the sequence {Ok} converges in order to establish our result. The 
proof is based on the identity 

f 9k (X k )- [ n(dx)fg k (x)=fg k (X k )-PgJg k (X k ), 

where fg is a solution of Poisson's equation (17). The decomposition 

fe k {Xk) - Pg k fe k (X k ) 
(20) = (4_ x (X k ) - Pg.Je^ 

+ (k(X k ) - f 9k _AXk)) + (iWfc-i(*k-i) " PeJe k (X k )), 

displays the different terms that need to be controlled to prove the LLN. 
The first term in the decomposition is (except at the time of a jump) a 
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martingale difference sequence. As we shall see, this is the leading term in 
the decomposition, and the other terms are remainders which are easily dealt 
with, thanks to the regularity of the solutions of Poisson's equation under 
(Al). The term fg k (Xk) — fe k _ 1 (Xk) can be interpreted as the perturbation 
introduced by adaptation. We preface our main result, Theorem 8, with 
two intermediate propositions concerned with the control of the fluctuations 
of the sum J2k=i(fd k (-Xfc) — Sx' K {dx)fo k {x)) for the inhomogeneous chain 
{(Afc,#fc)} under the probability The following lemma, whose proof is 
given in Appendix A, is required in order to prove Propositions 6 and 7: 



Lemma 5. Assume (Al). Let K, C be a compact set and r £ [0,1] 
be a constant. There exists a constant C [depending only on r, K, and the 
constants in (Al)] such that, for any sequences p = {pk} and a = {a k } of 
positive numbers and for any (x,6) S X x /C, 

(21) E p X:d [V r (X k )t{a(IC) > k}) < CV r (x), 

<C"Y^>J) V r {x), 



(22) K,8 

(23) K, 



max (a m V (X m )) r t{a{K) > m} 

Km<k 



max t{a{K) > m} V V r {X k ] 

Km<n , — ' 



k=l 



\ m=l 

< CnV r {x). 



Proposition 6. Assume (A1)-(A3). Let {fe,6 G 6} be a V a -Lipschitz 
family of functions for some a £ [0, 1 — j3), where V is defined in (Al) and 
j3 is defined in (A3). Let 1C be a compact subset of 0. Then, for any p G 
(l,l/(a + /?)], there exists a constant C [depending only on p, K, and the 
constants in (Al)] such that, for any sequence p = {p k } of positive numbers 
satisfying J2T=i k~ l Pk < °o, we have, for all (x, 9) € X x fC, 5 > 0, 7 > a and 
integer 1>1, 



(24) 



sup(m |M m |) > 5 

m>l 



<C5- p sup\\fe\\ p v J- {{p/2)A{p ~ 1)} V ap (x), 



sup< 1{cj(/C) > m}m ^ 
m>l \ 



k=l 



fe k (x)7r(dx) - M, 



(25) 



>5 



e&K. 



,k=l 
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x V ap (x), 

where cr(IC) is given in (13), fg ^ SfcLit-^e fo ~ n (fe)] ^ s a solution of Pois- 
sorfs equation (17) and 



(26) M m d ^ f t{a(K) > mjJ^ik^Xk) - Pg.Jg^X^i)]- 



Remark 3. The result provides us with some useful insights into the 
properties of MCMC with vanishing adaptation. First, whenever {Ok} C/Cc 
O for a deterministic compact set IC, the bounds above give explicit rates of 
convergence for ergodic averages (1). The price we must pay for adaptation 
is apparent on the righ-hand side of (25), as reported in [1]. The constraints 
on p, (5 and a illustrate the tradeoff between the rate of convergence, the 
smoothness of adaptation and the class of functions covered by our result. 
Scenarios of interest include the case where assumption (A3) is satisfied 
with (5 = [in other words, for any compact subset /CcO, the function 
su Peelc su Pxgx \ He{x)\ < oo] or the case where (A3) holds for any j3 £ (0, 1/2], 
both of which imply that the results of the proposition hold for any a < 1 
(see Theorem 15 for an application). 

Proof of Proposition 6. For notational simplicity, we set a d = a(IC). 
Let p £ (1, l/(a + /?)] and IC C G be a compact set. In this proof, C is a 
constant which only depends on the constants in (A1)-(A3), p and IC; this 
constant may take different values upon each appearance. Theorem 2 implies 
that for any 9 G 0, fg exists and is a solution of Poisson's equation (17). 
We decompose the sum t{a > m} YlT=i{fd k {Xk) — fx fe k (x)ir(dx)) as M m + 
R^+R^\ where 



def 

I \a > m) 

k=i 

def 



= t{a > m}(P Je o (X o ) - P Je m (X m )). 

We consider these terms separately. First, since l{a > m} = l{a > m}l{a > 
k} for < k < m, 



\M m \ = t{a > m} 
where 

def 



Y J Cfo k -AXk)-PB k Je k _AX k -i))t{a>k} 

k=l 



<\M m \, 



Til 



Mm = - Pe k _Je k ^(X k -i)}H<7 > k}. 

k=l 
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By Proposition 3, equation (18), there exists a constant C such that, for 
all 6 G/C, \\fe\\v a + \\Pefe\\v < C sup egC ||/e||v<»- Since <ap< 1, by using 
(21) in Lemma 5, we have, for all x,8 G X x /C, 



^{(iLw + |iVi/fc-i(**-i)i p )i{" > *» 

(27) 

^ r , i/' a vi ii f _ i, 

I v a ■ 



<cy^(x)su P ||Mi p 



ee/c 
Since 

K,e{\fo*-d x k) ~ P9 k Je k AXk-i)]t{° > fc}|^_i} 

= (iVi/**-i(**-i) - P^/^^-iMa > &;} = 0, 

{M m } is a (P^g, {J-fc})-adapted martingale with increments bounded in LP . 
Using Burkholder's inequality for p > 1 ([20], Theorem 2.10), we have 

( 2§ ) r / m \p/2<| 

Z C pKfi{yE\fo>-i(Xk) - P dk _Je k .AXk-i)\ 2 t{a >k}J j, 
where C p is a universal constant. For p > 2, by Minkowski's inequality, 

(29) , „ . p /2 

2/P I 

For 1 < p < 2, we have 



< C P E^lLW - P* fc _ 1 / flfc _ 1 (* fc -i)| p l{a > fc}}) 

U=i 



^ [ElViW - JVi/fc-i(**-i)l > A:}J | 

(30) 

By combining the two cases above and using (27), we obtain for any x,8 G 
X x K, and p > 1 , 

(31) E^{|M m r}<Cm^ 2 ) vl sup||M|^^(x). 

Let I > 1. By Birnbaum and Marshall's [10] inequality (a straightforward 
adaptation of Birnbaum and Marshall [10] result is given in Proposition 22, 
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Appendix A) and (31), there exists a constant C such that 



sup(m _1 |M m |) > 5 

m>l 



(m~ p - (m + l)- p )E^{|M m n 

\m=l j 



<C5^sup\\f e \\ p Vo 
eeic 



E 

. m=l 



m 



-p-l+(p/2)Vl 



V ap (x) 



<C6- p sup \\f 6 \\lJ- {(p W A(p - 1)} V ap (x), 
eeic 

which proves (24). We now consider Rm ■ Equation (19) shows that there 
exists a constant C such that, for any (9, 9') G /C x /C, \\fg — /g'Hya < C\9 — 
9'\ sup eg ^ ya. On the other hand, by construction, 9^ — Ok-i = PkH(9k-i> 
Xf.) for a >k and, under assumption (A3), there exists a constant C such 
that, for any x, 9 G X x K., \H(9,x)\ < CV^(x). Therefore, there exists C such 
that, for any m > I and 7 > a, 



fe=i 



< Csup ll/eHv" v 0~W Q+/3 (^fc)l{^ > 
d ^ k=l 

Hence, using Minkowski's inequality and (21), one deduces that there exists 
C such that, for (x, 9) G X X JC, 1>1 and 7 > a, 



(32) 



E^<jsupm 



-TP 



l-^m I 



. A = l 



Consider now -R m . The term Pg fg {Xo) does not pose any problem. From 
Lemma 5 [equation (22)] there exists a constant C such that, for all x,9 G 
X x /C and < a < 7, 



E£j supm-^|P, m /, m (X m )| p l{ f T > m} 



(33) 



6»G/C 



< Csup ||/ e ||^E^J sup |m-^ a F a (X m )| p l{c7>m} 



ap 



eeic 



. k=i 
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The case a = is straightforward. From Markov's inequality, (32) and (33), 
one deduces (25). □ 

We can now apply the results of Proposition 6 for the inhomogeneous 
Markov chain defined below (10) to the time-homogeneous time chain {Z k } 
under the assumption that the number of reinitializations K n is P* almost 
surely finite. Note that the very general form of the result will allow us to 
prove a central limit theorem in Section 4. 



Proposition 7. Let {JC q ,q > 0} be a compact coverage of and let 
7 = {7^} be a nonincreasing positive sequence such that X)fc=i^ _1 7fc < °°- 
Consider the time-homogeneous Markov chain {Z k } on Z with transition 

probability R, as defined in Section 2. Assume (A1)-(A3), and let j^ d = 
{fo, G 0} be a V a -Lipschitz family of functions for some a G [0, 1 — /3), with 
[3 as in (A3) and V as in (Al). Assume, in addition, that P^llim^-^oo K n < 
oo} = 1. Then, for any fg £ J 7 , 



(34) 



n 



Proof. Without loss of generality, we may assume that, for any 6 G 0, 

f x fe(x)Tr(dx) = 0. Let p£ (l,l/(a + /3)] and define = f lim^-^ K n . By 
construction of the T^'s [see (12)], and since Kqo < oo P*-a.s., we deduce 
that T Koo <oo, P*-a.s. We now decompose S n = J2k=i fo k {X k )t{v k / 0} 

as SJ^S^ + S^, where tf^&^Wlh/O} and ^ 

£Lt Boo +i Since T Koo < oo P,-a.s., E&i \fe k {X k )\t{u k + 0} < oo, 

P*-a.s., showing that n _1 Sn — ► 0, P*-a.s. We now bound the second term. 
For any integers n and A', 



sup m" 1 1 5<g) | > 5,^ < A 



m>n 



A 



(35) 



i=0 



sup m 



E 

fc=T;+l 



> (5, ACqo 



i, ?i < n/2 



Since, for n r 



+ P*[T Koo >n/2]. 
= i, = oo, 



sup m 



E 

fc=r»+i 



> 5, =i,Ti< n/2 



C < sup l{Ti + i > ?n}m 1 

I m>n 



E fSk( X k] 
k=Ti+l 



>S,Ti <n/2 
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C < ( sup t{a (fZi) > m}m 1 

( \m>n/2 



k=l 



T T >>5 



where we have used the fact that 2i+i = Tj + <j(/Cj) o r Ti , where r is the shift 
operator on the canonical space of the chain {Z n }. As a consequence, by ap- 
plying Lemma 1 [noting that l{o"(/Q) > m}l{<r(/Cj) > m} = t{a(JCi) > m} 
and l{a(/Cj) > m} G ^v»]i the strong Markov property, Proposition 6 with 
7 = 1, the fact that {7&} is nonincreasing and the fact that sup,,, eK V(x) < 00, 
we have, for 0<i<K, 



sup m 

m>n 



E fam 

k=T t + l 



>5,k 00 = i,Ti < n/2 



Gt, 



<P 7 



(36) 



sup l{a(JCi) > m}m 1 

m>n/2 



fc=i 



><5 



<C5~p sup||/ e ||p J EtKsjvA;]-^ 

^ l\fc=i > 



+ 



-p(l-a) 



+ 



-{(p/2)A(p-l)}' 



By Kronecker's lemma, the condition 2fe^=i ^ X 7fc < 00 implies that 
n 1 Efc=i 7fc — > as n — > 00, showing that 

00 L n /2J 00 

£>/2j Vfc]- 1 7fe <(Ln/2j)- 1 E 7fc+ E fc~Sk-0, 

fc=L ?t /2j + l 



fc=i 



fc=i 



as n — > 00. Combining this with (35) and (36) shows that, for any K,S,rj> 0, 
there exists N such that, for n> N, 



sup m" 1 !^ I 



m>n 



< 



Now, for if large enough that P*[koo > K] < rj, the result above shows that 
there exists an N such that, for any n> N, P*[sup m > n m _1 |Sm^| > 5] < 2r/, 
concluding the proof. □ 

Remark 4. How one checks P Jr (lim n _ >00 n n < 00) = 1 depends on the 
particular algorithm used to update the parameters. Verifiable conditions 
have been established in [3] for checking the stability of the algorithm; see 
Sections 5, 6 and 7. 



We may now state our main consistency result. 
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Theorem 8. Let {IC q ,q > 0} be a compact coverage of and let 7 = 
{lk} be a nonincreasing positive sequence such that J2T=i ^ _1 7fc < 00 • Con- 
sider the time-homogeneous Markov chain {Z^} on Z with transition proba- 
bility R, as defined in Section 2. Assume (Al)-(A3) and let /:X— >M be a 
function such that < 00 for some a G [0, 1 — 0), with /3 as in (A3) and 

V as in (Al). Assume, in addition, that P^-flim,!^^ n n < 00} = 1. Then, 

n 

(37) ^El/W-^)]^ - 

fe=i 

Proof. We may assume that vr(/) = 0. From Proposition 7, it is suffi- 
cient to prove that n" 1 Yl!j=\ |/(-^T 3 )| ~ 0- Since Kqo < 00 P*-a.s., 
Ej=i\f(X Tj )\ < 00 P*-a.s. The proof follows from E^=i 1/(^)1 < 
E7=il/(^T,)|- □ 

4. Invariance principle. We shall now prove an invariance principle. As 
in the case of homogeneous Markov chains, more stringent conditions are 
required here than for the simple LLN. In particular, we will require here that 
the series {9k} converges P*-a.s. This is in contrast with simple consistency 
for which boundedness of {9 k} was sufficient. The main idea of the proof 
consists of approximating n -1 / 2 J2k=i{f(Xk) ~ 7T (f)} with a triangular array 
of martingale differences sequence, and then applying an invariance principle 
for martingale differences to show the desired result. 

Theorem 9. Let {K, q ,q > 0} be a compact coverage of and let 7 = 
{lk} be o- nonincreasing positive sequence such that J2k^=i k~ l ^ 2 lk < 00. Con- 
sider the time-homogeneous Markov chain {Zk} on Z with transition proba- 
bility R, as defined in Section 2. Assume (A1)-(A3) and let f : X — > M satisfy 
f G Cy a , where V is defined in (A3) and a G [0, (1 — /3)/2) with (3 as in (Al). 
Define, for any 9 G 0, 

00 

(38) a 2 (9, f) d ^ f tt[( /* - Pefef] with f e d ^ f £ [P e k f - tt(/)] . 

k=0 

Assume, in addition, that there exists a random variable 9^ G 0, such that 
f x Ti(dx)f1 oo (x) < 00 and f x Tr(dx)(P eoo f doo (x)) 2 < 00 F*-a.s. and 

limsup \9 n — #oo| = 0, P*-a.s. 
n— »oo 

Then, 

k=l 
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where the random variable Z has characteristic function E*[exp(— ^a 2 (9 00 , 
f)t 2 )]. If in addition o-(8 OQ ,f) > 0, F+-a.s., then 

(39) * n E(/(X fc ) - n(f)) ^ Af(0, 1). 

Proof. Without loss of generality, we suppose that ir(f) = 0. The proof 
again relies on a martingale approximation. Set, for k > 1, 

(40) £ fc <M (X fc ) - P flfc _ 1 4_ 1 (X fc _ 1 )]l{i^_ 1 ^ 0}. 

Since / is 1/ Q -Lipschitz, Proposition 3 shows that {fg,0 G 6} and {Pefe-, G 
0} are 1/ Q -Lipschitz. Since 2a < 1, this implies that {PefQ-,6 G 0} and 
{(Pgf e ) 2 ,0 G 9} are y 2 "-Lipschitz. We deduce that is a (F*,{g fc ,fc > 
0})-adapted square-integrable martingale difference sequence, that is, for all 
k > 1, E*[£j^] < oo and [^"fcl^fc— l] = 0, P*-a.s. We are going to prove that 
with Z a r.v. with characteristic function E*[exp(— ^<t 2 (0 oo , f)t 2 )], 

1 n 

(41) J^-E^ Zj 

V n fc=l 
-. n -in 

( 42 ) "^E/W-^E&^O- 



n S V™ fc=1 

To show (41), we use [20], Corollary 3.1 of Theorem 3.2. We need to establish 
that: 

(a) the sequence n _1 J2k=i converges in P*-probability to 

^ 2 (0oo,/); 

(b) the conditional Lindeberg condition is satisfied, that is, 

n 

(43) foralle>0 n" 1 E > ev^}|^fc-i] ^'f, 0. 

fc=i 

We first prove (a). Note that 

%,[tl\Qk-i] = \Pe k Jl_S X k-i) ~ {Pe k Je k _AXk-i)?]t{v k -i + 0}. 

Since {P e f$,6 G 9} and {(P e /e) 2 , G 6} are T/ 2Q -Lipschitz and 2a G [0, 1 - 
we may apply Proposition 7 to prove that 



n k=i 



1 ™ 

-E / <dx)[PgJl(x) - (P e J dk (x)) 2 ]l{u k ^0} ^0. 



fc = 0' 
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For any j > and Koq = j, {6k, k > Tj} C ICj , which, together with the 
y 2a -Lipschitz property and the dominated convergence theorem, implies 
that 

P e Jl(x) - (P d J ek (x)) 2 

= lKo =j}J x *(d X )[Po a) f$Jx) - (PeJe^x)) 2 ], P*-a.s. 
By the dominated convergence theorem and the fact that P^(k 00 < oo) = 1, 
lim j TT(dx)[P e Jl(x)-(P e J ek (x)) 2 ]l{v k ^0} 

k— >ooJx 




and the Cesaro convergence theorem finally shows that 

-E / ^dx)[P e Jl(x)-(P e J ek (x)) 2 ]l{u k ^0} 

= [ n(dx)[P 6oo ^Jx) - {PeJeMf}, P*-a.s. 

J X 

We now establish the conditional Lindeberg condition in (b). We use the 
following lemma, which is a conditional version of [14], Lemma 3.3. 

Lemma 10. Let Q be a a-field and X a random variable such that 
E[X 2 \g] < oo. Then, for any e>0, 

4E[\X\ 2 1{\X\ >e}\g] >K[\X -K[X\g\\ 2 l{\X -E[X\g\\ >2e}\G}. 

Using Dvoretzky's lemma, we have for any e, M > and n sufficiently 
large, 

n 

n-^l^lfl&l^v^lSfc-i] 

k=l 

<4fi- 1 X] / Pe k (X k ,dx)fX(x)l{\fg k (x)\>M}l{u k ^O}, P*-a.s. 

Proceeding as above, the right-hand side of the previous display converges 
P*-a.s. to 

/ n(dx)f 2 Jx)t{\f eoc (x)\>M}, 



lim 1{koo = j} I Tr(dx) 
k— »oo 
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where we have used the fact that, for any 9 € O, kPq = ir. Since J x 7r(dx) x 
fe i x ) ^ 00 P > *" a - S -) the monotone convergence theorem implies that 



lim / n(dx)tiJx)t{\f 9x (x)\>M} = 0, 



*-a.s. , 



showing that the conditional Lindeberg condition (b) holds. 

In order to prove equation (42), we proceed along the lines of the proof of 

Proposition 7. First, since < 00, P*-a.s., E£=T + E£=T l€fcl < °°; 

P*-a.s., which implies that 



(44) 



-1/2 



E /(**) - E & 
*;=i fe=i 



*T5 0. 



Second, proceeding as in the proof of (36) and using (25) with 7 = 1/2 
and some p £ (1, l/(a + /?)], since / is y a -Lipschitz, we have that for any 
<i < K for some K > and n > 0, 



(45) 



sup m 

m>n 



-1/2 



<ca-*||/||£„ 



E /(**)- E 6 

fe=Tj+l fc=Ti+l 
f / 00 

EtL^Jvfc]- 1 / 2 



> 5, = i,Tj < n/2 



7*1 + 



>fc=i 



nN p(l/2-a)" 
2 



Under the assumption J2kLi & _ 7fe < proceeding as below equation 
(36), one can show that lmin^oo 2~Z£Li[|_ n /2j ) V k]~ 1 Ik = 0- Arguing as in 
(35), we conclude that 



(46) 



■/?? 



-1/2 



E /(**)- E 6 



k=Ti+l 



k=T, + l 



>5 0. 



The proof of (42) follows from (44) and (46). The proof of (39) follows from 
[20], Corollary 3.2 of Theorem 3.3. □ 



5. Stability and convergence of the stochastic approximation process. In 

order to conclude the part of this paper dedicated to the general theory of 
adaptive MCMC algorithms, we now present generally verifiable conditions 
under which the number of reinitializations of the algorithm that produces 
the Markov chain {Z^} described in Section 2 is P*-a.e. finite. This is a 
difficult problem per se, which has been worked out in a companion paper, 
[3]. We here briefly introduce the conditions under which this key property 
is satisfied and give (without proof) the main stability result. The reader 
should refer to [3] for more details. 
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As mentioned in the Introduction, the convergence of the stochastic ap- 
proximation procedure is closely related to the stability of the noiseless se- 
quence Ok+i = ^k + lk+\h{0k)- A practical technique for proving the stability 
of the noiseless sequence consists, when possible, of determining a Lyapunov 
function w.Q — » [0, oo) such that (Vw(9),h(6)) < 0, where Vw denotes the 
gradient of w with respect to 9 and, for u,v G W 1 , {u,v) is their Euclidean 
inner product (we will later on also use the notation \v\ = \J {v, v) to denote 
the Euclidean norm of v). This indeed shows that the noiseless sequence 
{w(9k)} eventually decreases, showing that limfc_ i . 00 w{9k) exists. It should 
therefore not be surprising if such a Lyapunov function can play an impor- 
tant role in showing the stability of the noisy sequence {9k}. With this in 
mind, we can now detail the conditions required to prove our convergence 
result: 

(A4) is an open subset of W 19 . The mean field h:Q—> W 19 is continuous, 
and there exists a continuously differentiable function w.Q — > [0,oo) 
[with the convention w{9) = oo when 9 ^ 0] such that: 

(i) For any M > 0, the level set W M = f {9 G 9, w{9) < M) C 9 is 
compact; 

(ii) the set of stationary point(s) C d = {9 G 6, (Vw{9), h(9)) = 0} 
belongs to the interior of 0; 

(hi) for any 9 G 0, (Vw(9),h(9)) < and the closure of w(£) has 
an empty interior. 

Finally we require some conditions on the sequence of stepsizes 7 = {7^}. 
(A5) The sequence 7 = {7^} is nonincreasing, positive and 



The following theorem is a straightforward simplification of [3], Theorems 
5.4 and 5.5, and shows that the tail probability of the number of reinitializa- 
tions decreases faster than any exponential, and that the parameter sequence 
{9k} converges to the stationary set C For a point x and a set A we dehne 



Theorem 11. Let {IC q ,q > 0} be a compact coverage of and let 7 = 
{7fc} be a real-valued sequence. Consider the time-homogeneous Markov chain 
{Zk} on Z with transition probability R, as defined in Section 2. Assume 



00 00 



^7 fc = oo and ^{ll + k 1/2 7 fc } < 00. 



k=l k=l 



d(x,A) 



dcf 



inf{|x — y \ : y G ^4}. 



(A1)-(A5). Then, 



lim sup k log 




= —00, 
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(x,d)eXxB ' [k^oo 



inf ¥ xfi lim d(9 k ,C) = = 1. 



6. Consistency and invariance principle for the adaptive N-SRW kernel. 

In this section we show how our results can be applied to the adaptive N- 
SRWM algorithm proposed by Haario, Saksman and Tamminen [19] and 
described in Section 1. We first illustrate how the conditions required to 
prove the LLN in [19] can be alleviated. In particular, no boundedness con- 
dition is required on the parameter set 0, but rather conditions on the tails 
of the target distribution ir. We then extend these results further and prove 
a central limit theorem (Theorem 15). 

In view of the results proved above it is required: 

(a) to prove the ergodicity and regularity conditions for the Markov ker- 
nels outlined in assumption (Al); 

(b) to prove that the reinitializations occur finitely many times (stabil- 
ity) and that eventually converges. Note again that the convergence 
property is only required for the CLT. 

We first focus on (a). The geometric ergodicity of the SRWM kernel has been 
studied by Roberts and Tweedie [31] and refined by Jarner and Hansen [22]; 
the regularity of the SRWM kernel has not, to the best of our knowledge, 
been considered in the literature. The geometric ergodicity of the SRWM 
kernel mainly depends on the tail properties of the target distribution ir. 
We will therefore restrict our discussion to target distributions that satisfy 
the following set of conditions. These are not minimal, but easy to check in 
practice (see [22] for details). 

(M) The probability density ir is defined on X = M. nx for some integer n x 
and has the following properties: 

(i) It is bounded, bounded away from zero on every compact set 
and continuously differentiable. 

(ii) It is super-exponential, that is, 



(iii) The contours dA(x) = {y : ir{y) = vr(x)} are asymptotically reg- 
ular, that is, 



We now establish uniform minorization and drift conditions for the SRWM 
algorithm defined in (3). Let A4(X) denote the set of probability densities 



lim ( — 

a-|^+oo \ \X 




= —CO. 
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w.r.t. the Lebesgue measure A Leb . For any a,b > 0, define Q a ,b(X) C A4(X) 
as follows: 

(47) Q a , b (X) d ^ {q£M(X),q(x)=q(-x) andinf q(x)>b\. 

Proposition 12. Assume (M). For any n£ (0,1), sei 1/ = ir' 11 / 
i su VxeX^( x ))~ v ■ Then: 

1. For any nonempty compact set CcX, iaere exists a > saca iaai, /or any 
6 > suc/i i/iai Q a ,h(X) 7^ 0, there exists e > saca £/iai C is a (1, e )-small 
set for the elements o/{P^ RW : q £ Q a ,b(X)}, u>i£/i minorization probability 
distribution ip such that, for any A £ £(X), <p(A) = A Lcb (A n C)/A Lcb (C), 
that is, 

(48) inf P SRW (rr, A) > ew(A) /or all x £ C and A £ S(X). 

9eSo,b(X) 

2. Furthermore, for any a > and > snc/i i/iai Q a ,&(X) 7^ 0, 

(49) sup limsup . . < 1, 

9GQ„,b(X) |*H+oo 

(50) sup \ , < +00. 

(as,9)6XxCa,b(X) V i x ) 

3. Le£ q,q' £ M(X) be two symmetric probability distributions. Then, for 
any r £ [0, 1] and any f £ Cyr , we have 

(51) ||P g SRW / - P^f\\vr < 2\\f\\ V r f \q(x) - q'(x)\X Lch (dx). 
The proof appears in Appendix C. 

As an example of an application, one can again consider the adaptive 
N-SRWM introduced earlier in Section 1, where the proposal distribution is 
jV(0,r). In the following lemma, we show that the mapping T — > P^f^^) is 
Lipschitz continuous. This result can be generalized to distributions in the 
curved exponential family (see Proposition 16). 

Lemma 13. Let KL be a convex compact subset of C]^ and set V = 
7r~ r '/(sup x 7r)~'' ? for some rj £ (0,1). For any r £ [0,1], any T,V £ K, x K, 
and any f £ Cyr , we have 

npSRW f pSRW f ii < 2n x || f |i ip r /i 

W^M(o,r)J - r M{o,r')J\\v r - ^ . (/ql^H yr l ~~ I' 

where, for T £ C™ x , \T\ 2 = Tr[rr T ] and A m i n (/C) is the minimum possible 
eigenvalue for matrices in K,. 
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The proof appears in Appendix D. We now turn to proving that the 
stochastic approximation procedure outlined by Haario, Saksman and Tam- 
minen [19] is ultimately pathwise bounded and eventually converges. In the 
case of the algorithm proposed by Haario, Saksman and Tamminen [19], 
the parameter estimates fi^ and take the form of maximum likelihood 
estimates under the i.i.d. multivariate Gaussian model. It therefore comes 
as no surprise if the Lyapunov function w required to check (A4) is the 
Kullback-Leibler divergence between the target density 7r and the normal 
density T), 

(52) w(jm, T) = logdetr + (/i - ^) T r~ 1 ( / u - ^) + Tr^iv), 

where and are the mean and covariance of the target distribution, 
defined in (9). Using straightforward algebra and the definition (8) of the 
mean field h, one can check that 

(Vw{n,T),h(n,T)) 

(53) =-2( / u-^) T r~ 1 (M-/^) 

- Tr{r-\T - r 7r )r- 1 (r - rg) - ((^ - ^fv- 1 ^ - ^)) 2 , 

that is, (Vw(9),h(9)} < for any 9 d = (//, T) G ©, with equality if and only 
if r = and fj, = [i^. The situation in this case is simple, as the set of 
stationary points {9 G 0, h{9) = 0} is reduced to a single point, and the 
Lyapunov function w goes to infinity as — > oo or T goes to the boundary 
of the cone of positive matrices. 

It can now be shown that these results lead to the following intermediate 
lemma; see [3] for details. 

Lemma 14. Assume (M) and let H and h be as in (5) and (8). Then, 
(A3) and (A4) are satisfied with V = 7r~ 1 /(sup x tt) -1 for any j3 G (0,1/2] 

and w as in (52). In addition, the set of stationary points C = f {6 G = f 
I" 1 x C+ x ,(Vw(9),h(9)) = 0} is reduced to a single point 9 W = ^,Y^), 
whose components are respectively the mean and covariance of the distri- 
bution 7T. 

From Proposition 12 and Lemma 13, we deduce our main theorem for 
this section, concerned with the adaptive N-SRWM of [19] as described in 
Section 1, but with reprojections as in Section 2. 

Theorem 15. Consider the process {Z k } with {Pg,9 = (n,T) G = f 
R n * x C r l x ,q 6 =N(0,\T),\> 0} as in (3), {H e ,9 G 0} as in equation (5), 
7r satisfying (M) ; 7 = {"fk,k > 0} satisfying (A5) and K a compact set. Let 
W = f tt" 1 /(supir)^ 1 . Then, for any a G [0,1): 
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1. For any f £ £(W a ) a strong LLN holds, that is, 

(54) n" 1 f; (/(X fc ) - / /(x)vr(dx)) ^ 0. 

2. For any / G C{W a / 2 ) a CLT holds, that is, 

1 n 

— J2lf(Xk) - tt(/)] AT(0, a 2 ^, /)), 

v n fc= i 



if<r{p*J)>0, 

E(/™-f(/))^p.A^(0,l) 



where 6^ = (fi n ,T n ) and o- 2 (9 n ,f) are defined in (38). 

The proof is immediate. We refer the reader to [19] for applications of 
this type of algorithm to various settings. 

7. Application: matching 7r with mixtures. 

7.1. Setup. The independence Metropolis-Hastings algorithm (IMH) cor- 
responds to the case where the proposal distribution used in an MH transi- 
tion probability does not depend on the current state of the MCMC chain, 
that is, q(x,y) = q(y) for some density q £ M(X). The transition kernel of 
the Metropolis algorithm is then given for x £ X and A £ B(X) by 

Pl MU (x,A)= [ a q (x,y)q(y)X hch (dy) 
J A 

(55) + t A {x) [ {I- a q (x, y))q(y)\ Lch (dy) 

Jx 

with a q {x,y ) = 1A ) UJ *\ ■ 

Irreducibility of Markov chains built on this model naturally require that 
q{x) > whenever tt{x) > 0. In fact, the performance of the IMH is known to 
depend on how well the proposal distribution mimics the target distribution, 
and this can be quantified in several ways. For example, it has been shown 
in [26] that the IMH sampler is geometrically ergodic if and only if there 
exists e > such that q £ Q £j7r C .A/f (X), where 

(56) Q £ , n = {q£M(X): X Lch ({x £ X : q{x)/ir{x) < e}) = 0}. 

This condition implies that the whole state space X is a (l,e)-small set, 
which in turn implies that convergence occurs uniformly, at a geometric 
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rate bounded above by 1 — e. Given a family of candidate proposal distri- 
butions {qg S M(X),0 £ 0}, it therefore seems natural to maximise 9 — ► 
inf xG x 7r ( a:; ) / Qe( x ) ■ However, although theoretically attractive, the optimiza- 
tion of this uniform criterion might be a very ambitious task in practice. 
Furthermore, it might not necessarily be a good choice for a given paramet- 
ric family of proposal distributions: one might in this case try to optimize 
the transition probability for pathological features of tt with small proba- 
bility under tt, at the expense of more fundamental characteristics of the 
target, such as its global shape. Additionally, such pathological features can 
very often be taken care of by other specialized MCMC updates. Instead 
of this uniform criterion, we suggest the optimization of an average prop- 
erty of the ratio tt{x) /qg{x) under tt, which possesses the advantage of being 
more amenable to computation. It is argued in [15] that minimizing the total 
variation distance \\tt — qe\\TV is a sensible criterion to optimize, since it can 
be proved that the expected acceptance probability is bounded below by 
1 — 1 1 -tt — 1 1 TV 5 an d that, for a bounded function /, the first covariance co- 
efficient of the Markov chain in the stationary regime is bounded as follows: 
cav 7r (f(X k ),f(X k+1 ))< (5/2) 2 sup xeX |/|||vr - qe\\ T v- However, no system- 
atic way of effectively minimizing this criterion is described. We propose 
here to use the Kullback-Leibler divergence between the target distribution 
tt and an auxiliary distribution qg close in some sense to qg, 



The proposal distribution qg of the IMH algorithm is then constructed from 
qg. As we shall see, this offers an additional degree of freedom which, in par- 
ticular, will be a simple way of ensuring that {qg,0 G 0} C Q £ ,-w, defined in 
(56), for some e > (see Remark 7). The use of this criterion possesses sev- 
eral advantages. First, invoking Pinsker's inequality, it is possible to repeat 
Gasemyr [15] arguments. Second, it formalizes several ideas that have been 
proposed in the literature (cf. [15] and [17] among others). In [17] it is sug- 
gested to use the EM (Expectation-Minimization) algorithm in order to fit a 
mixture of normals in the, possibly penalized, maximum likelihood sense to 
samples from a preliminary run of an MCMC algorithm. This mixture can 
then be used to define the proposal distribution of an IMH. As we shall see, 
the choice of the Kullback-Leibler (KL) divergence corresponds precisely 
to this choice and naturally leads to an on-line EM algorithm that allows 
us to adjust qg to tt as samples from tt become available from the MCMC 
sampler. Finally, we point out at this stage that, although we restrict here 
our discussion to the IMH algorithm, the KL criterion can equally be used 
for other updates, such as the SRWM algorithm. The algorithm proposed 
by Haario, Saksman and Tamminen [19] is in this case a particular instance 
of the algorithm hereafter. 



(57) 
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In order to allow for flexibility and the description of a general class of 
algorithms, we consider here mixtures of distributions in the exponential 
family for the auxiliary proposal distribution. More precisely, let E C 
and Z C M™ 2 , for some integers and n z , and define the following family 
of exponential probability densities (defined with respect to the product 
measure A Leb ® \i for some measure n on Z) 

£c = {f: /{(x, z) = exp{-V(£) + (T(x, z), <£(£)}}; £, x, z G S x X x Z}, 

where ip : E -> R, (j> : E -> M™" and T : X x Z -> W le . Let £ denote the set of 
densities that are marginals of densities from S c , that is, such that for 
any £, x £ E x X we have 

(58) = ^ ft(x, z)n(dz). 

This family of densities covers in particular finite mixtures of multivariate 
normal distributions. Here, the variable z plays the role of the label of the 
class, which is not observed (see, e.g., [32]). Using standard missing data 
terminology, f^(x,z) is the complete data likelihood and q^ is the associ- 
ated incomplete data likelihood, which is the marginal of the complete data 
likelihood with respect to the class labels. When the number of observa- 
tions is fixed, a classical approach to estimating the parameters of a mixture 
distribution consists of using the EM algorithm. 

7.2. Classical EM algorithm. The classical EM algorithm is an itera- 
tive procedure which consists of two steps. Given n independent samples 
(Xi, . . . , X n ) distributed marginally according to ir: (1) Expectation step: 
calculate the conditional expectation of the complete data log-likelihood, 
given the observations and (the estimate of £ at iteration k) 

n 

^Q(Z,Zk) = J2 E { l °s(M x ii z i))\ x iitik}. 

i=l 

(2) Maximization step: maximize the function £ Q(£, with respect to £. 
The new estimate for £ is = ar g maxg g = Q(£, (provided that it exists 
and is unique). The key property at the core of the EM algorithm is that 
the incomplete data likelihood nf=i <l€k+i P^) — FfiLi Q~£k ( x i) is increased at 
each iteration, with equality if and only if ^ is a stationary point (i.e., a lo- 
cal or global minimum or a saddle point). Under mild additional conditions 
(see, e.g., [33]), the EM algorithm therefore converges to stationary points 
of the marginal likelihood. Note that, when n — > oo, under appropriate con- 
ditions, the renormalized incomplete data log-likelihood n _1 Ya=i ^°SQ^i x i) 
converges to E 7r [log^(X)], which is equal, up to a constant and a sign, to 
the Kullback-Leibler divergence between tt and q^. In our particular setting, 
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the classical batch form of the algorithm is as follows: first define for (£H 
the conditional distribution 

(59) ut=(x,z) d = ^j X,Z \ 

qd x ) 

where is given by (58). Now, assuming that J z \T(x, z)\v^(x, z)n(dz) < oo, 
one can define for x 6 X and ^£H 

(60) V(T(x) d = J T(x,z)i/£(x,z)fi(dz), 
and check that for S £ c and any (£,£') 6BxS that 

E{log(f^X i ,Z i ))\X i ,e} = L(^ / T(X i );0, 
where L : x S — > M is defined as 

(61) L{9- i) ^ -m + (9, m) with e ^ T(X, Z). 
From this, one easily deduces that, for n samples, 

Q{Uk) = nL[-Y j v^T{X i )-A. 



n . 
■> i=i 



Assuming now for simplicity that, for all 9 € O, the function £ — > L(9;^) 
reaches its maximum at a single point denoted by £(#) [i.e., L(9;£(9)) > 
L{9] £) for all £ € S], the EM recursion can then be simply written as 



1 n 



n . 
i=i 



The condition on the existence and uniqueness of £(#) is not restrictive. It 
is, for example, satisfied for finite mixtures of normal distributions. More 
sophisticated generalizations of the EM algorithm have been developed in 
order to deal with situations where this condition is not satisfied; see, for 
example, [25]. 

Our scenario differs from the classical setup above in two respects. First, 
the number of samples considered evolves with time, which requires that we 
estimate £ on the fly. Second, the samples {Xi} are generated by a transition 
probability with invariant distribution tt and are therefore not independent. 
We address the first problem in Section 7.3 and the two problems simul- 
taneously in Section 7.4 where we describe our particular adaptive MCMC 
algorithm. 
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7.3. Sequential EM algorithm. Sequential implementations of the EM 
algorithm for estimating the parameters of a mixture when the data are 
observed sequentially in time have been considered by several authors (see 
[32], Chapter 6, [5] and the references therein). The version presented here 
is, in many respects, a standard adaptation of these algorithms and consists 
of recursively and jointly estimating and maximizing with respect to £ the 
function 

9(0=^[logq i (X)]=ir{vT i (X)}, 

which, as pointed out earlier, is the Kullback-Leibler divergence between tt 
and q£, up to an additive constant and a sign. At iteration k + 1, given an 
estimate 9 k of 9 and = sample X k+ \ ~ tt and calculate 

9 k+1 = (1 - lk+i)0k + Jk+i^ k T(X k+1 ) 

(62) 

= 9 k + -f k+ i(u iik T(X k+ i) - 9 k ), 

where {^ k } is a sequence of stepsizes and ^ k € [0, 1]. This can be interpreted 
as a stochastic approximation algorithm 9 k+ i = 9 k + "f k +iH(9 k , X k+ i) with, 
for 9 e 9, 

(63) H{9,x) = v^ e) T(x)-9 and h{9) =ir{v^ e) T) - 9. 

At this stage, it is possible to introduce a set of simple conditions on 
the distributions in £ c that ensures the convergence of the sequence {9 k } 
defined above. By convergence, we mean here that {9 k } converges to the set 
of stationary points of the Kullback-Leibler divergence between ir and q^ , 
that is, 

£ = {9eQ:Vw(9) = 0}, 

where, for 9 £ 

(64) w(0)=K(n\\q m ), 

and K and qt are defined in (57) and (58), respectively. It is worth noticing 
that these very same conditions will be used to prove the convergence of our 
adaptive MCMC algorithm: 

(El) (i) The sets 5 and O are open subsets of W 1 ^ and R ne , respectively. 
Z is a compact subset of M. Uz . 

(ii) For any x E X, T(x) d = inf{M : fi({z : \T(x, z)\ > M}) = 0} < oo. 

(iii) The functions ip : S — > M. and (f> : S — > R™ 9 are twice continuously 
differ entiable on S. 

(iv) There exists a continuously differentiable function £ : — > S 
such that, for all 0,£ G 9 x S, 

L(0;e>))>L(0;6. 
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Remark 5. For many models, the function £ — > L(#;£) admits a unique 
global maximum for any 9 G G, and the existence and differentiability of 
9 —►£(#) follows from the implicit function theorem under mild regularity 
conditions. 



(E2) (i) The level sets {9 G 6, w(0) < M} for M > are compact. 

(ii) The set £ = f {9 G 6,Vw(S) = 0} of stationary points is in- 
cluded in a compact subset of 0. 

(iii) The closure of w{£) has an empty interior. 

Remark 6. Assumption (E2) depends on both the properties of ir and 
and should therefore be checked on a case-by-case basis. Note, how- 
ever, that (a) these assumptions are satisfied for finite mixtures of distribu- 
tions in the exponential family under classical technical conditions on the 
parametrization beyond the scope of the present paper (see, among others 
[32], Chapter 6, and [5] for details) (b) the third assumption in (E2) can 
very often be checked using Sard's theorem. 

We first prove here an intermediate proposition concerned with estimates 
of the variation — q^i under (E2) in various senses. Note that most of these 
results are not used in this section, but will be useful in the following. 

Proposition 16. Let {q^, £ G E} C £ be a family of distributions satis- 
fying (El). Then, for any convex compact set K, C E: 

1. There exists a constant C < oo such that 

(65) sup|V 5 log^(x)| <C(1 + T(x)). 

2. For any £,£',x G K, 2 x X there exists a constant C < oo such that 

(66) \qs(x) - q?(x)\ < C|£ - £'|(1 + T(x)) sup^(x). 

3. For W -> [l,oo) such that sup 5eC / x ^(x)[l + T(x)]W(x)X Lch (dx) < oo 
and any G fC, there exists a constant C < oo such that 



(67) / \q i (x)-q e (x)\W(x)X Leh (dx)<C\^-e\ 



heb/ 

X 

The proof appears in Appendix E. The key to establishing the conver- 
gence of the stochastic approximation procedure here consists of proving 
that w(9) = K(ir\\q^ d j) plays the role of a Lyapunov function. This is hardly 
surprising as the algorithm aims at minimizing sequentially in time the in- 
complete data likelihood. More precisely, we have: 
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Proposition 17. Assume (El). Then, for all 6^ 0, (Vw(0),h(8)) < 0, 



where 8 h- > h{8) is given in (63). 

The proof appears in Appendix E. Another important result needed to 
prove convergence is the regularity of the field 8 \— > Hg. We have: 

Proposition 18. Assume (El). Then {Hg,8 G 0} is (1 + T) 2 -Lipschitz, 
where Hg is defined in (63). 

The proof appears in Appendix E. With this, and standard results on the 
convergence of SA, one may show that the SA procedure converges pointwise 
under (El) and (E2). 

7.4. On-line EM for IMH adaptation. We now consider the combination 
of the sequential EM algorithm described earlier with the IMH sampler. As 
we shall see in Proposition 20, using q^ as a proposal distribution for the 
IMH transition is not sufficient to ensure the convergence of the algorithm, 
and it will be necessary to use a mixture of a fixed distribution £ (which will 
not be updated during the successive iterations) and an adaptive component, 
here Q^gy More precisely, we define the following family of parametrized 
IMH transition probabilities {Pg, 8 G 0}: for e G (0, 1] let £ G Q e>7r (assumed 
nonempty) be a density which does not depend on 8 G 0, let i G (0, 1) and 
define the family of IMH transition probabilities 



(70) V e , C ^ {Pg ^ i* MH , 8 G 0} with {qg ^ (1 - t)q m + <, 8 G 0}. 



The following properties on £ and £ c will be required in order to ensure that 
T> eX satisfies (El) and (E2): 

(E3) (i) There exist e > and £ G Q e ^ such that, for any compact K. C 



(ii) There exists W — > [1, oo) such that, for any compact subset K. C 



and 



(68) 
(69) 



£ = {8eO: (Vw(8), h{8)) = 0} = {8 G : Vw(8) = 0}, 
e(£) = UGH:V^(7r||^)=0}, 



(71) 
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and sup zgK W(x) < oo, where K is denned in Section 2. 

Remark 7. It is worth pointing out that the above choice for qg and the 
condition £ G Q e ^ automatically ensure that {qe-,6 G 0} C Q £i7r for e = ei. 

The 6asic version (see Section 2) of our algorithm now proceeds as follows: 
set #o 6 0,(o = £(#o) an d draw Xq according to some initial distribution. At 
iteration k + 1 for k > 0, draw X k+ \ ~ Pg k (X k , •) where Pq is given in (70). 
Compute 9 k+l = k + j k+ i(u^T(X k+1 ) - k ) and Ck+i = We will 

study here the corresponding algorithm with reprojections which results in 
the homogeneous Markov chain {Z k ,k > 0} as described in Section 2. 

We now establish intermediate results about V e ,c, an d {He,0 G ©} which 
will lead to the proof that (Al)-(A3) are satisfied. We start with a general 
proposition about the properties of IMH transition probabilities, relevant to 
checking (Al). 

PROPOSITION 19. Let V :X— > [l,+oo) ; and let q G Q £i7T for some e > 0. 
Then: 

1. X is a (l,e)-small set with minorization distribution tp = q, and 
Pl Mn V(x) < (l-e)V(x) + q(V), where q(V) = f q(x)V (x) A Leb (dx) . 

J X 

2. For any f £ Ly , and any proposal distributions q,q' G Q £ ,tt, 

-iii D rMH> d imh. 

<?' 

Leb/ 

X 

+ [q(V) V q'(V)]((l A |1 - g "V|l) V (1 A |1 - (qTWl))- 

The proof appears in Appendix F. In contrast with the SRWM, the 
l/-norm ||-Pg MH / — -Pg/ MH /||y can be large, even in situations where J x \q(x) — 

q'(x)\ x V(x)X Lch (dx) is small. This stems from the fact that the ratio of 
densities q/q' enters the upper bound above. As we shall see in Proposition 
20 below, this is what motivates our definition of the proposal distributions 
in (70) as a mixture of q^ G £ and a nonadaptive distribution £ which satisfies 
(E3). 

Proposition 20. Assume that the family of distributions G E} C 

£ satisfies (El) and (E3). Then, the family of transition kernels V e £ given 
in (70) satisfies (Al) with e = ei, V = W , (f = C an d, if W is bounded, 
then C = X, A = 0, otherwise choose e G (0, ec) such that C = {x : V(x) < 
e _1 sup eg £ qg{V)} is such that ("(C) > 0, and set A = 1 — ei + e. 



(72) < / \q(x) -q'(x)\V{x)X hcb (dx 
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The proof appears in Appendix F. We are now in a position to present 
our final result: 

Theorem 21. Let 7r G M(X) and £E} C £ be a family of distri- 

butions. Define := T(X,Z). Consider the following families of transition 
probabilities and functions: 

(i) {Pe,9 G O}, as in (70), where C G Qe,7r for some e > 0, £ H} is 
further assumed to satisfy (El), (E3) [with V such that T G Cyp/2 for some 
(3 G [0,1)] and (E2); 

(ii) {H g ,8€ 9} as in (63). 

Let {/C g ,g > 0} 6e a compact coverage of 0, Ze£ K be a compact set and let 
-f = {"fie} satisfy (A5). Consider the time-homogeneous Markov chain {Z^} 
on Z with transition probability R, as defined in Section 2. Then, for any 
(x, 9) € K x /Co and any a < 1 — (5: 

1 . For any f G £ya ; 

n 

n-^jTO-vrCf))^, P*-a.s. 
fc=i 

2. There P^-a.s. exists a random variable Ooo G {# G : V0-RT(7r||g|^) = 0} 
such that, provided that cx^oo, /) > and for any f G Cy a /2, 

1 n 

^g-jjgUW) -,(/)) \«», 

where a(9,f) is given as in (38). 

Proof. The application of Propositions 17, 18 and 20 shows that (Al)- 
(A3) are satisfied, which, together with (E2) and (A5), implies Theorem 11. 
Then, we conclude by invoking Theorems 8 and 9. □ 

Remark 8. It is worth noting that, provided tt £M(X) satisfies (M), 
the results of Propositions 12, 16, 17 and 18 proved in this paper easily 
allow one to establish a result similar to Theorem 21 for a generalization of 
the N-SRWM of [19] (described here in Section 1 and studied in Section 5) 
to the case where the proposal distribution belongs to £, that is, when the 
proposal is a mixture of distributions. 

APPENDIX A: STABILITY OF THE INHOMOGENEOUS CHAIN 

Proof of Lemma 5. Under (Al) and (A2), we have, for x,9 G X x 
and k > 1 , 

E p X)0 [V(X k )t{a(lC) > k}} =E%[EP[V(X k )\^ 1 }l{a()C) > k}} 
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< X&%\V{X k ^)t{a{]C) >k — l}] + b. 
Now, a straightforward induction leads to 

E p Xt0 [V(X k )t{a(JC) > k}] < X k V(x) + J^, 

which shows that there exists a constant C [depending only on K, and the 
constants appearing in (Al)] such that, for all k > 0, 

(73) E^ e [V(X k )l{a(K.)>k}]<CV(x). 

Now, for any r G [0, 1], by Jensen's inequality, we have, for any k > 0, 

(74) E%[V r (X k )t{a{K) > k}] < {E%[V(X k )t{a{K) > k}]) r < C r V r (x), 
showing (21). Similarly, again using Jensen's inequality, 



max (a m V(X m )) r l{a(IC) > m} 

Km<k 



< I Ke 



max a m V '(X m )l{a(JC) > m} 

Km<k 



/ k \ r / k \ r 

< a m % p xfi [V{X m )t{a{JC) > m}\) < C r ( £ % V r (x), 

\m=l / \m=l / 



showing (22). Finally, since 



t{a()C) > m} £ V r (X k ) < £ V r (X k )t{a(JC) > k}, 

k=l k=l 



we have 



x,0 



max l{a{K.) > m} Vf(I t ) 

Km<n , — ' 

fe=l 



^F r (^)l{^(^)>^} 



<CnV r (x), 



showing (23). □ 



The following proposition is a direct adaptation of Birnbaum and Marshall 
[10] inequality: 

Proposition 22. Let {S k ,T k ,k > 0} be a submartingale, that is, 
E{S k \J- k ^i) > a.e. Let {a k > 0, 1 < k < n} be a nonincreasing real- 

valued sequence. If p > 1 is such that E\S k \ p < oo for k G {1, . . . , n}, then for 
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m<n, 

p{ max a k \S k \>l\< n j2K-<+i) E \Sk\ p + a p n E\S n \ p . 

\rn<k<n I 

k=m 

APPENDIX B: PROOF OF PROPOSITIONS 3 AND 4 

In the sequel, C is a generic constant, which may take different values 
upon each appearance. 

Proof of Proposition 3. Let K, C O be a compact set and let r G 
[0, 1]. For any (6, 6') G K. x K. and / G £(y r )> 

n-l 

W - i$/ = E W - Pe>)Per j ~ l f 

3=0 

= E(^' " ~ PeW^f - *(/)), 

j=0 

where we have used the fact that 7rP# = itPqi = ir for any 9,9'. Theorem 2 
shows that there exists a constant C and p G (0, 1) such that, for any 9 G /C, 
/ > and any / G Cyr , 

(75) ||^/-7r(/)ll^-<C|l/llvV- 

Under assumption (A2), for any (6,6') G K, x /C, Z > and any / G Cyr, 

\\(Pl - v)(P e - P ei )(P^ j - l f - K(f))\\v 
<C^\\(Pg-P el )(P^- l f-n(f))\\yr 

■cc^d-d'w^r^f-AMyr 
<c\e-6'\\\f\\ V rp n , 

showing that there exists a constant C < oo such that, for any (6, 6') G K, X K. 
and / G £y , 

(76) \\Iff - P^/Hv- < CV>|0 - ll/Uvr. 

Now consider {fe,6 G ©}, a family of V r -Lipschitz functions. From (75), for 

any 6 G K, ET=o W« ~ *tfe)\ < oo and fg d ^ £gLo(P£/ fl - *(/<,)) belongs 
to Now we consider the difference 

oo oo 

fe - fe> = J2( p 0fo ~ <fe)) ~ E( P £iV ~ <M) 

k=0 k=0 

oo oo 

= E( p e fe - P&fo) - Y,( p 0>(fo' ~ fe) ~ <fo> ~ fe)), 

k=0 k=0 
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Wfo - U\\vr < c\e - e'\ f f; kp k ) \\M\vr + c( f] P k ) \\f e - /, 
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0' V, 



.fc=0 



,fc=0 



and we conclude by using the fact that {fo,0 £ 0} is a y r -Lipschitz family 
of functions. Using the same arguments one can prove a similar bound for 
\\Pefe — Pe'fe'\\v r - d 

Proof of Proposition 4. For simplicity, we set a := cr(K) and, in 
what follows, C is a finite constant whose value might change upon each 
appearance. Let x,9 E X x G. For k > feo, we introduce the following decom- 
position: 

l< e {(/(^)-^(/))l(^> fc )}l 

< \E%{(f(X k ) - P^ n(k) f(X k _ n{k) ))l(a > k)}\ 



By Theorem 2, the last term is bounded by C\\f\\ v i-pV 1 ~ p (x)p n{ ^ < 
Cp~ 1 \\f\\yi-0pkV(x). We consider the first term and use the following new 
decomposition of this bias term (cf. [19]), 

" P £t k) f(Xk-n(k)m(v > k)}\ 



< 



< 



n(k) 

E <eW;\ + J(Xk- ]+ i) - 4 k JiXu-MHp > k — j + 1)} 
i=2 

n(k) 

E Ke{KA p 'C + j(x k ^ +l) - 1% ;.a.y, ,} 

i=2 



x l(cr > fc - j + 1)} 



n(k) 



3=2 

xl(a>k-j + 1)}| 

n(fe)-l 

<C||/||yi^ E JP J Pk- ] KA V ( X k-i) 1 ^>k-j)} 
< C||/llyi-/3U(x)p fc _ n(fc)+ i, 
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where we have successively used (76) with r = 1 — f3, (A3), the fact that p is 
assumed nonincreasing and (74). We conclude with the additional condition 
on p. □ 

APPENDIX C: PROOF OF PROPOSITION 12 

For any x G X, define the acceptance region A(x) = {z G X; ir(x + z) > 
7r(x)} and the rejection region R(x) = {z £ X;tt(x + z) < tt(x)}. From the 
definition (47) of Q a ^ ([31], Theorem 2.2) applies for any q G Q a ^ and we 
can conclude that (48) is satisfied. Noting that the two sets A(x) and R(x) 
do not depend on the proposal distribution q and using the conclusion of 
the proof of Theorem 4.3 of [22], we have 



inf liminf / q{z) \ Lch (dz) > 0, 

GQa.b \x\-> + OoJMx) 



q£Qa,b |a:|-H-oo JA(x) 

so that, from the conclusion of the proof of Theorem 4.1 of [22], 

sup limsup— - — — = 1— inf liminf / q(z)X Leh (dz) < 1, 

q&Qa,b |x|-H-oo V(x) qeQ a ,b\x\^+ooJA(x) 

which proves (49). Finally, for any q G JC a ^, 



V(x) Ja(x) 7r(sc) 1 

< sup (1 - u + u 1 ^'), 

0<M<1 

which proves (50). Now notice that 

jf HW /(x)-if RW /(x)= / a(x,x + z)(^)- (? '(z))/( 2 ; + z)A Lcb (dz) 

+ /(x) / a(x,x + z)(g / (z)-g(z))A Lcb (dz). 

We therefore focus, for r G [0, 1] and / G £yr, on the term 
| J x a(x, x + z)(q(z) - q'(z))f(x + z)X Lcb (dz)\ 

Wf\\vrVr(x) 

^ f., a(x.x + z)\a 



J x a(x,x + z)\q{z) - q'(z)\V r (x + z)\ Leh (dz) 



V r (x) 

<[ AX ( + S v rV \Q(z)-q'(z)\X^\dz) 
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+ 



\q(z)-q'(z)\\ Leb (dz) 



lR(x) vr(x) 1 "? 
< / \q(z)-q'(z)\X Lch (dz). 

We now conclude that, for any x £ X and any / 6 Cy, 
I pSRW f /„\ _ pSRW f / \i 

' 9 /( ; r(g y /( j ' < 211/Hv. / x |«(*) " q '(z)\X Lch (dz). 

APPENDIX D: PROOF OF LEMMA 13 
For notational simplicity, we write q-p for AA(0,r). We have 



\q r (z)-q r (z)\X Leb (dz) 



and let = T + v(V — T), so that 
d 



1 d_ 

dv 



qv + v(r-T){z)\ hch (dv) 



X Lch {dz), 



^-iogq r+vi T'-r)(z) = --TVfr-^r' - r) +r- 1 ^ T r- 1 (r' - r)] 



dv 

and consequently 



x Leb (dz) < |r' - r| / \r^\x Lch (dv) 

Tin 



< 



ir'-n 



Amin(^C) 

where we have used the following inequality: 

\Tr[T- 1 zz T T- 1 (T'-T)}\ < \V - rlT^r" 1 ** 7 ]. 

APPENDIX E: PROOFS OF PROPOSITIONS 16, 17 AND 18 

Hereafter, for a scalar function s, Vs is a column vector, and for a (col- 
umn) vector- valued function v with scalar entries v\ , V2 , ■ ■ ■ , we use the con- 
vention that Vv is the matrix with Vt>j as its ith column. 

Proof of Proposition 16. We first note that, from Fisher's identity, 
we have 

V£G E V § logg ? (x) = J V^log f^(x,z)^(x,z)fi(dz) 

= -V^(0+V^(O^T(x) 

and, from (El), we conclude that (65) holds. Equation (66) is a direct 
consequence of (65). Now we prove (67). With = f ?/>(£') — ?/>(£) and 
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A/^(o-<K£)for e.e'es, 

; (x)-^(x)|W(x)A Leb (dx) 
/o /z 



A^ - (T(x,z), A^(x,z)fl / - v (x,z) f i(dz)X Leh (dv) 



x W(x)A Lcb ((ix) 
• i 



< IA 



4-\ 



XJO JZ 



fl(x, z)f^r v (x, z)W{x)n(dz)X Lcb (dv dx) 



+ |A | J J T{x) lJ^(x,z)flr v (x,z)W(x)fi(dz)X Lcb (dvdx) 



<|A,, 



W(x)^(rr)A Leb (dx) 



W(z)g e (x)A Leb ((fc) 



"1 l-D 



A Leb (cfo) 



+ |A, 



T(x)W(x)q ( {x)X Leb (dx) 



T{x)W{x)q il {x)\ hch {dx) 



l-v 



X heh (dv), 



and we conclude by invoking the assumptions on W, 4> and ip. □ 

Proof of Proposition 17. We first note that, from Fisher's identity, 
we have 

V(GS V 5 log<7 5 (x) = J V^\ogf^(x,z)u^(x,z)fj,(dz) 

= -v c ^(0 + v^(0^r(x). 

From (65) and (El), we may derive under the sum sign to show that 



V 5 / TT(x)logq i (x)X hch ( y dx)= I 7r(x)V 5 log^(x)A ljCb (dx) 
•/X </ X 

= -v^(£) + v^£M^t), 

and thus, by the chain rule of derivations, 

V^(0) = -V e i(9)(-V^(i(e)) + V^(i(9))7T(u i{e) T)). 

For any (9 G 0, £(#) is a stationary point of the mapping £ — > L(6,^) and, 
thus, 

v 5 l(^,|(0)) = -v^(|(0)) + v^(i(e))e = o. 
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Consequently, (63) implies that V e w(6) = -V e f (0)V $ 0(|(0))/i(0). We also 
notice that VgV^L(9,( t ) = Vg0(£) T . Differentiation with respect to 6 of the 
mapping 9 h^V^L(6,^(8)) yields 

v e v^L(e,ae))=v^(i(e)y + v e i(e)vlL(e,i(e)) = o. 

We finally have 

(v w(e),h(9)) = h(e) T Vea9)vlL(e,i(e))(Vei(e)) T h(e), 

which concludes the proof, since, under (El), V|L(0, £(#)) < 0, for any 6 £ Q. 

□ 

Proof of Proposition 18. For any ieX, 

(77) \H e (x)-H e ,(x)\<T(x) f \vfo ) (x,z)-V£ {e , ) (x,z)\n(dz) + \0 , -0\. 

From Proposition 16 one has that, for any compact set /C C S, there exists 
a constant C such that, for all £, z £ K, x Z, 

\V $ logf^x,z)\<C(l + T(x)) and | V 5 loggfa £)| < C(l + T(x)). 

Thus, 

| V 5 log i/ c (x, z) | < | V 5 log £ (x, z) | + | V 5 log q t (x) | < 2C( 1 + T(x) ) . 
Hence, for all £, £' £ /C and z£2, 

|i/ c (x, z) - ^(x, *)| < 2C(1 + T(x))|£ - 
which, together with equation (77), concludes the proof. □ 

APPENDIX F: PROOFS OF PROPOSITIONS 19 AND 20 

Proof of Proposition 19. The minorization condition is a classical 
result; see [26]. Now notice that 

P l q Mn V{x) = ^a q (x,y)V(y)q(y)X Lch (dy) 

+ V{x) ( [l-a q (x,y)]q(y)X hch (dy) 
Jx 

where a„ is given in (55). The drift condition follows. 
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From the definition of the transition probability, and for any / € Cy , 
\P^f(x)-P^f(x)\ 

< WfWvl I \a q (x,y)q(y)-a q ,(x,y)q'(y)\V(y)\ Lch (dy) 



+ V(x) J | cy (x, y)q\y) -a q (x, y)q{y) | A Lcb (dy) 
<2\\f\\ v V(x) [ \a q (x,y)q(y)-a q/ (x,y)q'(y)\V(y)X Lch (dy). 



We therefore bound 
I- 



x 



q(y) A oi^l _ q'(y) A q\x) 



7r(y)V(y)X Lcb (dy) 



7r(y) vr(x) n(y) ir(x) 

We introduce the following sets: 

A / v f q(y) . q(x) \ . . J c/(y) g'(x) 1 

I 7r(y) vr(x) J I 7r(y) 7r(x) J 

and note that the following inequalities hold: 



(78) 



VyeA c q ,(x)nA c q (x) %(y) < — 7— rc/(y) A -yr\q'(y) and 
Vy 6 A^s) n B c q (x) 7r(y) < ^tf{v) A q{y))- 



We now decompose 7 into four terms I A = Ylt=ih, where 



A„nA 9 , 



q(y) q'(y) 



vr(y) vr(y) 



+ 



+ 



/ 


q(x) 


c/(x) 


/A ,nA=, 


7I"(x) 


7I"(x) 


I 




c/(x) 


/ A„nA c , 


7r(y) 


7r(x) 


/ 


g(x) 




'A q nA ql 


7I"(x) 


7r(y) 



7r(y)y(y)A Lcb (dy) 
vr(y)y(y)A Lcb (dy) 
vr(y)y(y)A Lcb ((iy) 
vr(y)y(y)A Lcb (dy). 



Here we have dropped x in the set notation for simplicity. We now determine 
= 2,3. 

q'{x) 



bounds for 7j, i = 2, 3. Notice that, since y G AJj n A£,, 



/ 2 < 



1 



g(x) 



y(y) g (y)A Lcb (dy) 



A=nA c 



A 
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q(x) 



< 



q'(x) 
q'(x) 



A=nA c 



V(y)q'(y)X Leh (dy) 



q(x) 



AsnA c 



A 



1 



q(x) 



q'(x) 

V(y)q(y)\ Lch (dy)V 



A=nA c 



V(y)q'(y)\ Leh (dy) 



and it can easily be checked that 

'm-v 

q q' 



1-1 



1-1 



V < 1 A 



1-1 



The term ^3 can be bounded as follows: 



A 



+ 



q(y)V(y)X Lch (dy) 
q(x) q'{x) 



A g nA= nB^ 



7r(x) 7r(x) / JAqnA^nB 

9(2/) 



and using (78) we find that 



V(yMy)\ Lch (dy) 
V(y)7r(y)\ Lch (dy), 



h < U A 



1 



Ky)^(y)A Leb (dy) 



+ 



A 9 nA=,nB g 



A 5 nA-,nB^ 

'(y)-9(y)l^(y)A Lcb (^). 



4.3 



The bound for I4 follows from that of I3 by swapping q and q' . □ 

Proof of Proposition 20. The first claim follows directly from Propo- 
sition 19 and the assumptions. Now denote 

T M def (1 - a)q^{x) + aC(x) = gg(x) - q^{x) 

a[X) (1 " a)q e (x) + aC(x) + C(s)fo'(*)/C(z) + a/(l - a)] ' 

Therefore, from (65), for any convex compact set /C C S, there exists C < 00 
such that, for any G /C 2 x X, 

11 x /■ m ^ 1 — Q= |gg(a?) — g^(^)l ^/| SU Pge/cgg(^)( 1 + ^(^)) 

11 " T «'- ^ c^) — - le " e 1 cm ' 

which, with (71), implies that, for all £, £' € K, and for A Leb -almost all x, 
there exists C < 00 such that 

(1 A |1 - ?w, a (x)\) V (1 A |1 - ?z>,z, a ( x )\) < C\Z ~ t'V 
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Now, as a direct consequence of equation (67), one can show that there exists 
C such that for any £, £' £ /C and r G [0, 1], 



The proof is concluded by application of Proposition 19. □ 
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